DE eng

Search in the Catalogues and Directories

Page: 1 2 3 4 5...9
Hits 1 – 20 of 174

1
Abstracts from the KAS corpus KAS-Abs 2.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
BASE
Show details
2
Corpus of 1968 Slovenian literature Maj68 2.0
BASE
Show details
3
Corpus of academic Slovene KAS 2.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko; Erjavec, Tomaž; Fišer, Darja; Ljubešić, Nikola; Ferme, Marko; Borovič, Mladen; Boškovič, Borko; Ojsteršek, Milan; Hrovat, Goran. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
Abstract: The KAS corpus of Slovene academic writing consists of almost 65,000 BSc/BA, 16,000 MSc/MA and 1,600 PhD theses (82 thousand texts, 5 million pages or 1,5 billion tokens) written 2000 - 2018 and gathered from the digital libraries of Slovene higher education institutions via the Slovene Open Science portal (http://openscience.si/). The theses have associated with them significant metadata, while each thesis in the corpus contains its textual body, i.e. without their front and back matter. The body is divided into chapters, then into pages, these into paragraphs, and then into sentences. The sentence tokens are tagged with morphosyntactically descriptions (detailed part-of-speech tags) and the words lemmatised. As opposed to the previous version 1.0, the KAS corpus of Slovene academic writing 2.0 is cleaner and contains segmentations into chapters. The metadata also contains more information about research fields of each work. Both versions consist of the same number of BSc/BA, MSc/MA, and PhD theses, however, the processing was done from scratch for 2.0, so the number of e.g. pages and tokens is different. Note also that the new version does not contain links to the PNG pictures of individual pages , nor does it contain annotated terms, both present in version 1.0. It is, unlike 1.0, also not mounted on the CLARIN.SI concordancers. The new version is distributed in the canonical TEI encoding, JSON, and as plain text files. In the TEI format, chapter names are denoted with the tag. Each entry in JSON files have a string ID and a list containing names of chapters as its first element and texts as its second element. Chapters without text are represented as an empty string. The plain text files contain only text bodies without segmentation information. References: Žagar, A., Kavaš, M., & Robnik Šikonja, M. (2021). Corpus KAS 2.0: cleaner and with new datasets. In Information Society - IS 2021: Proceedings of the 24th International Multiconference. https://doi.org/10.5281/zenodo.5562228
Keyword: academic writing; BSc/BA theses; MSc/MA theses; PhD theses; TEI
URL: http://hdl.handle.net/11356/1448
BASE
Hide details
4
Summarization datasets from the KAS corpus KAS-Sum 1.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
BASE
Show details
5
Machine Translation datasets from the KAS corpus KAS-MT 1.0
Žagar, Aleš; Kavaš, Matic; Robnik-Šikonja, Marko. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2022. : Faculty of Computer and Information Science, University of Ljubljana, 2022
BASE
Show details
6
Collection of Slovenian paremiological units Pregovori 1.0
Babič, Saša; Miha, Peče; Erjavec, Tomaž. - : ZRC SAZU, 2022. : Jožef Stefan Institute, 2022
BASE
Show details
7
The ParlaMint corpora of parliamentary proceedings
BASE
Show details
8
The ParlaMint corpora of parliamentary proceedings
In: Lang Resour Eval (2022)
BASE
Show details
9
Universal Dependencies 2.9
Zeman, Daniel; Nivre, Joakim; Abrams, Mitchell. - : Universal Dependencies Consortium, 2021
BASE
Show details
10
Universal Dependencies 2.8.1
Zeman, Daniel; Nivre, Joakim; Abrams, Mitchell. - : Universal Dependencies Consortium, 2021
BASE
Show details
11
Universal Dependencies 2.8
Zeman, Daniel; Nivre, Joakim; Abrams, Mitchell. - : Universal Dependencies Consortium, 2021
BASE
Show details
12
Spoken corpus Gos 1.1
Zwitter Vitez, Ana; Zemljarič Miklavčič, Jana; Krek, Simon. - : Centre for Language Resources and Technologies, University of Ljubljana, 2021
BASE
Show details
13
Corpus of 1968 Slovenian literature Maj68 1.0
BASE
Show details
14
Corpus of term-annotated texts RSDO5 1.1
BASE
Show details
15
Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.0
Ljubešić, Nikola; Fišer, Darja; Erjavec, Tomaž. - : Jožef Stefan Institute, 2021
BASE
Show details
16
Montenegrin web corpus meWaC 1.0
Ljubešić, Nikola; Erjavec, Tomaž. - : Jožef Stefan Institute, 2021
BASE
Show details
17
Comparable corpora of South-Slavic Wikipedias CLASSLA-Wikipedia 1.0
Ljubešić, Nikola; Markoski, Filip; Markoska, Elena. - : Jožef Stefan Institute, 2021
BASE
Show details
18
Training corpus ssj500k 2.3
Krek, Simon; Dobrovoljc, Kaja; Erjavec, Tomaž. - : Centre for Language Resources and Technologies, University of Ljubljana, 2021
BASE
Show details
19
Spoken corpus Gos VideoLectures 4.2 (transcription)
Verdonik, Darinka; Potočnik, Tomaž; Sepesy Maučec, Mirjam. - : Faculty of Electrical Engineering and Computer Science, University of Maribor, 2021
BASE
Show details
20
Multilingual comparable corpora of parliamentary debates ParlaMint 2.1
BASE
Show details

Page: 1 2 3 4 5...9

Catalogues
2
0
0
0
6
0
0
Bibliographies
7
0
0
0
0
0
2
0
0
Linked Open Data catalogues
0
Online resources
1
0
0
0
Open access documents
155
0
3
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern